AAN+: Generalized Average Attention Network for Accelerating Neural Transformer

نویسندگان

چکیده

Transformer benefits from the high parallelization of attention networks in fast training, but it still suffers slow decoding partially due to linear dependency O(m) decoder self-attention on previous target words at inference. In this paper, we propose a generalized average network (AAN+) aiming speeding up by reducing O(1). We find that learned weights follow some patterns which can be approximated via dynamic structure. Based insight, develop AAN+, extending our previously proposed (Zhang et al., 2018a, AAN) support more general position- and content-based patterns. AAN+ only requires maintain small constant number hidden states during decoding, ensuring its O(1) dependency. apply as drop-in replacement selfattention conduct experiments machine translation (with diverse language pairs), table-to-text generation document summarization. With masking tricks programming, enables decode sentences around 20% faster without largely compromising training speed performance. Our results further reveal importance localness (neighboring words) capability modeling long-range

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accelerating Convolutional Neural Network Systems

Convolutional Neural Networks have recently been shown to be highly effective classifiers for image and speech data. Due to the large volume of data required to build useful models, and the complexity of the models themselves, efficiency has become one of the primary concerns. This work shows that frequency domain methods can be utilised to accelerate the performance training, inference, and sl...

متن کامل

Accelerating Recurrent Neural Network Training

An efficient algorithm for recurrent neural network training is presented. The approach increases the training speed for tasks where a length of the input sequence may vary significantly. The proposed approach is based on the optimal batch bucketing by input sequence length and data parallelization on multiple graphical processing units. The baseline training performance without sequence bucket...

متن کامل

A synergetic neural network-genetic scheme for optimal transformer construction

In this paper, a combined neural network and an evolutionary programming scheme is proposed to improve the quality of wound core distribution transformers in an industrial environment by exploiting information derived from both the construction and transformer design phase. In particular, the neural network architecture is responsible for predicting transformer iron losses prior to their assemb...

متن کامل

Accelerating the Super-Resolution Convolutional Neural Network

As a successful deep model applied in image super-resolution (SR), the Super-Resolution Convolutional Neural Network (SRCNN) [1, 2] has demonstrated superior performance to the previous hand-crafted models either in speed and restoration quality. However, the high computational cost still hinders it from practical usage that demands real-time performance (24 fps). In this paper, we aim at accel...

متن کامل

A generalized ABFT technique using a fault tolerant neural network

In this paper we first show that standard BP algorithm cannot yeild to a uniform information distribution over the neural network architecture. A measure of sensitivity is defined to evaluate fault tolerance of neural network and then we show that the sensitivity of a link is closely related to the amount of information passes through it. Based on this assumption, we prove that the distribu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Artificial Intelligence Research

سال: 2022

ISSN: ['1076-9757', '1943-5037']

DOI: https://doi.org/10.1613/jair.1.13896